Week 8.2 - AI and Scientific Images

The Most Important Lesson of the Week

The headline finding from the CharXiv benchmark (2024): on real scientific charts from actual published papers, the best model at publication achieved 47.1% on reasoning questions versus 80.5% for humans. Frontier model scores have improved since, but the benchmark has been updated to keep pace — and the core lesson stands: real-world performance on actual scientific figures consistently and substantially lags performance on the simplified benchmarks used in marketing claims.

But that finding coexists with models that genuinely help with certain scientific image tasks. The 47% problem does not mean “never use AI with images.” It means you need to be precise about which tasks AI does reliably and which it does not.

This sub-lesson covers: what AI can genuinely help with, the CharXiv finding and its root cause, the correct-answer-wrong-reasoning problem, domain-specific vs. general-purpose tools, demographic bias in image AI, and a practical five-step workflow for using AI with scientific images safely.

47.1% GPT-4o on CharXiv at launch (2024 baseline — frontier models have since improved, but real-world gaps persist)

80.5% Human performance on the same CharXiv reasoning questions (2024 benchmark)

34.5pp Benchmark-to-real-chart drop for open-source models

58.07% Average model score on simple geometric reasoning tasks (VLMs Are Blind)

35.5% Correct medical image answers where GPT-4V's visual reasoning was demonstrably flawed

✓ What AI Can Genuinely Help With

The 47% problem should make you cautious about quantitative reading of charts — it should not make you dismiss AI-assisted image work entirely. There is a meaningful class of image tasks where current multimodal AI performs reliably and adds genuine value to research workflows.

📝 Accessibility and Documentation

Generating alt text and descriptive figure captions is one of the most reliable uses of multimodal AI. The model produces fluent natural-language descriptions of visual content — which is exactly where its language-side strength applies.

Alt text generation for journal figures
Figure captions from a visual description
Descriptive documentation of microscopy images
Labelling and cataloguing image datasets

📈 Qualitative Chart Interpretation

Asking AI to explain what a figure shows in plain language — without relying on it to read specific values — is generally reliable. Models can identify chart types, describe main trends, and summarise the stated conclusion of a figure.

“This figure shows a positive trend between X and Y”
Identifying the type of visualisation
Describing the main visual pattern or trend
Summarising the figure's stated purpose or title

📷 Visual Comparison

Comparing two images for obvious, large-scale visual differences is within the reliable range. This includes before/after pairs, comparing specimens, or identifying gross structural differences between two images.

Before/after comparisons (treatment vs. control)
Identifying large structural differences between specimens
Describing obvious changes between two versions of a figure
Spotting gross formatting or layout issues in a figure

🔧 Figure Design Feedback

AI can provide useful qualitative feedback on chart design choices: whether a colour palette is likely to cause accessibility issues, whether a chart type suits the data, whether labels are readable. This is a description task, not a reading task.

Accessibility checks (colour, contrast)
Chart type appropriateness for the data shown
Label clarity and axis readability
Consistency of design across a figure set

🔍 Image Classification (Qualitative)

Broad categorical classification — identifying what type of specimen or structure is shown, flagging an image as likely belonging to a category — is more reliable than precise measurement or quantitative reading.

Identifying the general category of a biological specimen
Classifying image type (bar chart, scatter plot, micrograph)
Flagging images that appear anomalous for human review
Tagging images by content for search and retrieval

✏️ Writing Support for Figures

Given a description of what a figure shows, AI can draft the figure caption, suggest legend text, or help revise the written description of results that reference a figure. The model works from the text you provide — not by independently reading the image.

Drafting figure captions from a description you write
Revising results-section text about a figure
Suggesting more precise language for visual descriptions
Checking that caption text matches the description you provided

💡 The Reliable Zone: Description Tasks

A useful heuristic: AI is most reliable when the task is a description task (what does this image show, qualitatively?) rather than a value-extraction task (what is the exact value at this point on the chart?). Keep AI in the description zone and verify any specific quantitative claims against the underlying data.

📊 The CharXiv Finding — When Benchmarks Lie

📊 Case Study: CharXiv — NeurIPS 2024

The CharXiv benchmark was constructed by researchers at Princeton University specifically to address a known flaw in prior chart-understanding benchmarks: they used simplified, purpose-built test charts rather than charts from real scientific papers. Real scientific charts are more complex, more varied in design, and more likely to require genuine reasoning rather than pattern-matching to common formats.

Dataset: 2,323 diverse charts extracted from actual papers on arXiv, spanning 16 scientific domains including mathematics, physics, computer science, and biology.

Question types: Two categories — descriptive questions (what is shown) and reasoning questions (what does the data mean, what can be inferred). The gap between AI and human performance is much larger on reasoning questions.

Note: The scores below are 2024 baselines from the original CharXiv paper. Frontier model performance on the benchmark has improved substantially since publication — the benchmark has also been updated to maintain difficulty. The pedagogical point (real-world performance lags simplified benchmark scores) remains well-supported.

Model	CharXiv Reasoning Score	Standard Benchmark Score (ChartQA / AI2D)	Drop
GPT-4o	47.1%	~85–90%+	~38–43pp
Claude 3.5 Sonnet	~60%	94.7% (AI2D), 90.8% (ChartQA)	~30–35pp
Open-source leaders	Varies; drops up to 34.5pp	Often 80–90%+ on standard benchmarks	Up to 34.5pp
Human baseline	80.5%	N/A	—

Why does this happen? The companion paper “Vision Language Models Are Blind” (Rahmanzadehgervi et al., ACCV 2024) tested models on tasks that should be trivial for any visual system: do two circles overlap? how many lines intersect at this point? Models averaged 58.07% on these tasks — close to chance on some sub-tasks. The models are reading the axis labels, title text, and captions — the language surrounding the chart — and using that text to generate plausible descriptions. They are not processing the geometry of the chart itself.

The practical implication: Never trust AI-extracted numerical values from charts without verification against the underlying data. The model's description of what a chart shows may sound authoritative while being numerically wrong. Frontier model scores on CharXiv have improved substantially since 2024, but evaluations of newer models continue to show gaps between performance on real scientific figures and performance on simplified benchmarks — the headline gap has narrowed, but the underlying pattern is durable.

The Benchmark Paradox

A model that scores 94%+ on AI2D but substantially less on CharXiv is not dishonestly marketed. It is a genuine reflection of the fact that AI2D tests charts that look like AI2D charts — standardised, simple, purpose-built for evaluation. CharXiv tests charts that look like real published science — complex, varied, and requiring genuine reasoning. This pattern holds across benchmark generations: as models improve, the gap between simplified and real-world benchmarks shrinks more slowly than headline scores suggest.

When you read a press release or product page claiming high accuracy on chart understanding, the first question should be: what benchmark, and how does that benchmark compare to the images I actually need to analyse? The answer almost always favours the marketing claim. The CharXiv-style real-world performance is usually not in the headline.

This is the epistemological problem that runs through the entire week: AI performance in research contexts is almost always lower than AI performance on the benchmarks used to market the model.

⚠️ The Correct Answer, Wrong Reasoning Problem

The CharXiv finding shows that AI gets scientific chart questions wrong. A separate and more insidious problem is when AI gets the right answer — but for the wrong reason. This is documented in medical image analysis, but the pattern generalises.

⚠️ GPT-4V in Clinical Image Challenges: Right Answer, Wrong Reasoning

Source: Jin et al., npj Digital Medicine, July 2024. “Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine.”

The headline finding: GPT-4V achieved 81.6% accuracy on NEJM Image Challenges vs. 77.8% for human physicians. On the face of it, this looks like AI exceeding human performance — a capability milestone.

The hidden problem: In 35.5% of correct answers, the visual reasoning rationale provided by the model was demonstrably flawed. The image comprehension error rate was approximately 27%. The model was sometimes arriving at correct diagnoses while citing visual features that were not actually present in the image — effectively guessing correctly and then confabulating a justification.

Why this matters beyond medicine: You cannot detect this problem by looking at accuracy metrics alone. You have to read the reasoning. A model that explains a figure convincingly but incorrectly will mislead reviewers, supervisors, and students — especially students who do not yet have the domain knowledge to recognise that the explanation is wrong. The model's confident, fluent prose creates the impression of genuine understanding where there is none.

The practical consequence: When using AI to interpret a scientific image, ask it to explain its reasoning — and then check whether that reasoning actually describes what is in the image. The answer may be right. The reasoning may be fabricated. You need to verify both separately.

Why Fluency Is Not a Signal of Accuracy

Modern language models are trained to produce fluent, coherent, contextually appropriate text. This training objective is completely independent of whether the content is factually accurate. A model describing a scientific image is optimising for a description that reads plausibly given the image and its context — not one that is pixel-accurately correct.

This means the most fluent, confident-sounding description of a figure is not necessarily the most accurate one. In practice, models are often most confidently wrong when they are in familiar territory — a bar chart that looks like the many bar charts in their training data, where the pattern-matching is strong and the reasoning is not needed.

The corollary: when a model hedges, expresses uncertainty, or says it cannot read a specific value, that is actually a more trustworthy signal than confident fluency. Calibrated uncertainty is a feature, not a failure.

🎯 Domain-Specific Image AI

General-purpose LLMs (Claude, GPT, Gemini) are not the only class of image AI available to researchers. In several scientific domains, purpose-built models trained specifically on domain data substantially outperform general-purpose tools on well-defined classification tasks.

💊 Medical and Microscopy Imaging

Domain-trained CNN and vision-transformer architectures report 97–99% accuracy on the standard NIH thin-blood-smear malaria cell-image dataset (Mujahid et al., 2024, Scientific Reports; CNN–ViT ensembles reach ~99.6% in 2025 work). These are curated benchmark figures, not field-deployment accuracy, and have not been formally compared against expert pathologist performance on the same images. Real-world clinical accuracy is typically lower.

General-purpose LLMs are not substitutes for these systems in diagnostic or clinical research contexts. As the Journal of Nuclear Medicine DREAM Report (2025) documents, applying general-purpose language models to domain-specific image interpretation produces systematic hallucination failures that are qualitatively different from those seen in general text tasks.

Use domain-trained models for classification tasks
Use general-purpose LLMs for description and documentation
Never use LLMs as a substitute for clinical image analysis

🌏 Satellite and Geospatial Imagery

This is one of the most active areas of AI for science, with a layered ecosystem worth understanding before picking a tool.

Pretrained foundation models for Earth observation — weights trained on satellite data, used as backbones for downstream tasks. Prithvi / Prithvi-EO-2.0 (IBM/NASA, on Harmonized Landsat Sentinel data) is the most established and the most widely used in research; Clay (ClayFM, 2024) is a generalist Earth-observation foundation model with growing adoption; SatMAE (Stanford, NeurIPS 2022) and Scale-MAE (2023) are masked-autoencoder approaches; SatCLIP (Microsoft/ETH, 2024) aligns satellite imagery with text in CLIP style. These are backbones — you load the weights through a framework rather than running them directly.

Frameworks to actually use them — the Python libraries that handle data loading, fine-tuning, and inference. TerraTorch (IBM) is purpose-built for Prithvi and other geospatial foundation models and is the closest thing to a research-ready default. torchgeo (Microsoft) is the established PyTorch library for geospatial datasets and models. geoai-py (Wu, JOSS 2026) is a broader Python toolkit aimed at lowering the barrier to entry.

Cloud APIs — Google's Geospatial Reasoning with Gemini applies a general-purpose multimodal model to geospatial data. Useful for description, qualitative reasoning, and rapid prototyping; less reliable for precise classification than purpose-trained satellite models.

Where to start: TerraTorch + Prithvi-EO-2.0 for research-grade Earth observation work
Quick exploration: Gemini Geospatial Reasoning
Caveat: the field is moving fast — check current benchmarks before committing

🔮 The General Rule

If your field has domain-trained models available, use them for precision tasks — classification, detection, measurement. Use general-purpose LLMs for description, documentation, and qualitative interpretation.

The two categories complement each other. A domain-trained model can classify cells with high accuracy on a curated benchmark; a general-purpose LLM can explain what the classification means to a non-specialist, draft the methods section describing how the model was used, or help write the figure caption. Neither tool does everything.

Precision tasks → domain-trained models
Description and communication → general LLMs
They are not competitors; they are complementary

⚖️ Bias in Image Recognition

Researchers whose image data involves human subjects need to be aware of systematic bias in image AI systems. The disparities documented in this area are large — large enough to invalidate research conclusions if not accounted for.

⚠️ Demographic Disparities in Facial Recognition

The MIT Media Lab Gender Shades study (Buolamwini & Gebru, 2018) documented gender-classification error rates of approximately 0.8% for light-skinned men vs. approximately 34.7% for dark-skinned women on commercial systems of that era — a roughly 43-fold gap. Note that this was a gender-classification audit on 2017-vintage APIs, not modern face recognition.

More current evidence comes from NIST's Face Recognition Vendor Test (FRVT) demographic evaluations (NISTIR 8429; pages.nist.gov/frvt/html/frvt_demographics.html): in 1:1 verification, false-positive rates for African and East Asian subjects are commonly 10–100× higher than for White subjects, while false-negative rates differ by up to roughly 3×. Differentials vary substantially by algorithm, image quality, and dynamic range — and the best-performing systems have narrowed these gaps considerably since 2018, though they have not closed them.

For UCT researchers: If you are using AI to analyse images of people — in social science, public health, education research, community health — do not assume accuracy from published benchmarks, which frequently underrepresent darker-skinned subjects. Test accuracy across demographic groups on your specific dataset. This is not a hypothetical concern at a South African university.

The practical step: Before applying any image AI system to a dataset involving human subjects, run it on a stratified subsample where you know the ground truth, broken down by relevant demographic variables. Report the stratified accuracy in your methods section.

⚠️ AI-Generated Images of STEM Professionals

Studies published in 2025 (Scientific Reports) documented significant racial and gender homogenisation in AI-generated images of scientists. When prompted to generate images of researchers, scientists, or STEM professionals, major image generation models disproportionately produce images of white, male, older individuals — reinforcing rather than challenging existing stereotypes about who belongs in science.

For researchers producing educational materials, public communication, or visual content about their field: be explicit about prompting for diversity, and audit outputs before publication. The default outputs of image generation models are not demographically neutral.

For researchers studying representation in science communication: these tools are generating real-world data you can study. The bias is consistent, measurable, and varies across models and prompt formulations.

💡 Broader Bias Principle

The facial recognition and image generation findings are the most documented, but the underlying principle is general: AI image systems reflect the demographic distribution of their training data. In any domain where your research population differs from the typical composition of large image datasets (which are often heavily US- and Europe-weighted), you should explicitly test for differential accuracy before trusting aggregate benchmark scores.

⚙️ A Practical Workflow for Scientific Images

Given the limitations documented above, here is a five-step workflow for using AI with scientific images in research. Each step is designed to catch a specific failure mode before it propagates into your analysis.

Use AI for description and qualitative interpretation — not numerical reading. Ask the model to explain what the figure shows in plain language: what type of chart is it, what is the main trend or pattern, what is the stated relationship between variables. Stay in the description zone. Do not ask it to read specific values from axes or data points.
Never trust AI-extracted numbers without verification. If your analysis requires specific values from a chart — data points, y-axis values, slopes, percentages — go back to the underlying data, the figure's data table if one is provided, or the paper's supplementary materials. Treat any number the AI reads from a chart as a hypothesis to be checked, not a fact to be used.
Read the reasoning, not just the answer. Ask the AI to explain how it reached its interpretation of a figure. Check whether that reasoning actually describes what is visually present in the image. If the model cites visual features that are not there, the answer is unreliable even if it sounds plausible. This is the direct mitigation for the Jin et al. finding: verification requires reading the explanation, not just the conclusion.
Use domain-trained tools for precision tasks. If your field has specialist image AI — clinical imaging tools, satellite analysis platforms, microscopy classification systems — use those for any task that requires measurement, classification, or detection. General-purpose LLMs are not substitutes for tools trained specifically on your image type.
Test on images where you know the answer before scaling up. Before applying AI image analysis to your full dataset, run it on a subset where you have ground truth — either from prior labelling, independent expert review, or the underlying data. Measure accuracy on that subsample. Only scale up if accuracy on the subsample is acceptable for your research context. This step will catch most CharXiv-style failures before they become research errors.

📚 Core Readings

Wang et al. (2024)

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

NeurIPS 2024, Princeton University. The definitive benchmark study for AI chart understanding on real scientific figures. Establishes the 47% finding, provides a root-cause analysis, and includes a human baseline. Essential reading before using any AI tool with scientific charts.

2,323 real scientific charts from arXiv papers
Descriptive vs. reasoning question analysis
Human performance baseline

arxiv.org/abs/2406.18521 ↗

Rahmanzadehgervi et al. (2024)

Vision Language Models Are Blind

ACCV 2024. Tests models on trivially simple geometric tasks — overlapping circles, intersecting lines. Average model score: 58.07%. Provides the mechanistic explanation for the CharXiv result: models are reading text labels, not processing image geometry. A short, accessible paper with significant implications.

Seven simple geometric task categories
4 state-of-the-art VLMs evaluated
Clear failure mode taxonomy

arxiv.org/abs/2407.06581 ↗

Jin et al. (2024)

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

npj Digital Medicine, July 2024. GPT-4V achieves 81.6% on NEJM Image Challenges vs. 77.8% for physicians — but in 35.5% of correct answers, the reasoning was demonstrably flawed, with an image comprehension error rate of approximately 27%. Documents the correct-answer-wrong-reasoning problem in a rigorous clinical evaluation. The implications extend well beyond medicine.

NEJM Image Challenge evaluation dataset
Detailed reasoning error analysis
Comparison with expert physician performance

nature.com ↗

✅ Summary and What's Next

Key Takeaways from Sub-Lesson 2

The 47% finding (CharXiv, NeurIPS 2024) is the central result: on real scientific charts from actual published papers, the best general-purpose AI achieves 47.1% on reasoning questions vs. 80.5% for humans. This is not a marginal performance gap — it means you cannot trust AI to reliably read specific information from scientific figures. The root cause is that models are reading text labels and captions, not processing chart geometry.

The correct-answer-wrong-reasoning problem (Jin et al., npj Digital Medicine, 2024) adds a second layer: even when the model's answer is correct, the visual reasoning behind it may be fabricated. Verification requires reading the explanation, not just checking the answer.

Domain-specific vs. general-purpose tools: For precision classification tasks in medical imaging, satellite analysis, or microscopy, domain-trained models substantially outperform general-purpose LLMs. Use general-purpose LLMs for description and documentation; use domain tools for measurement and classification.

Bias: NIST FRVT demographic evaluations show that face-recognition systems still produce false-positive rates 10–100× higher and false-negative rates up to ~3× higher for African and East Asian subjects than for White subjects in 1:1 verification — though differentials vary substantially by algorithm. For South African researchers, this is a direct methodological concern that requires stratified accuracy testing on your own dataset.

The practical workflow (five steps above) operationalises these findings: stay in the description zone, verify all numbers independently, read the reasoning not just the answer, use domain tools for precision tasks, and test on a labelled subsample before scaling.

Next: Sub-Lesson 3 covers documents and tables — where many of the same principles apply but with additional complexity from structured data extraction, OCR, and PDF table parsing. The reading-vs-understanding gap shows up there too, with different failure modes.